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Abstract.  We  consider  real-time  games  where  the  goal  consists,  for  each  player, 
in  maximizing  the  average  reward  he  or  she  receives  per  time  unit.  We  consider 
zero-sum  rewards,  so  that  a  reward  of  +r  to  one  player  corresponds  to  a  reward 
of  —  r  to  the  other  player.  The  games  are  played  on  discrete-time  game  structures 
which  can  be  specified  using  a  two-player  version  of  timed  automata  whose  loca¬ 
tions  are  labeled  by  reward  rates.  Even  though  the  rewards  themselves  are  zero- 
sum,  the  games  are  not,  due  to  the  requirement  that  time  must  progress  along  a 
play  of  the  game. 

Since  we  focus  on  control  applications,  we  define  the  value  of  the  game  to  a 
player  to  be  the  maximal  average  reward  per  time  unit  that  the  player  can  ensure. 
We  show  that,  in  general,  the  values  to  players  1  and  2  do  not  sum  to  zero.  We 
provide  algorithms  for  computing  the  value  of  the  game  for  either  player;  the  al¬ 
gorithms  are  based  on  the  relationship  between  the  original,  infinite-round  game, 
and  a  derived  game  that  is  played  for  only  finitely  many  rounds.  As  memoryless 
optimal  strategies  exist  for  both  players  in  both  games,  we  show  that  the  problem 
of  computing  the  value  of  the  game  is  in  NPHcoNP. 


1  Introduction 

Games  provide  a  setting  for  the  study  of  control  problems.  It  is  natural  to  view  a  system 
and  its  controller  as  two  players  in  a  game;  the  problem  of  synthesizing  a  controller 
given  a  control  goal  can  be  phrased  as  the  problem  of  fi  nding  a  controller  strategy 
that  enforces  the  goal,  regardless  of  how  the  system  behaves  [Chu63,RW89,PR89], 
In  the  control  of  real-time  systems,  the  games  must  not  only  model  the  interac¬ 
tion  steps  between  the  system  and  the  controller,  but  also  the  amount  of  time  that 
elapses  between  these  steps.  This  leads  to  timed  games,  a  model  that  was  fi  rst  ap¬ 
plied  to  the  synthesis  of  controllers  for  safety,  reachability,  and  other  m-regular  goals 
[MPS95,AH97,AMAS98,HHM99,dAFH+03],  More  recently,  the  problem  of  design¬ 
ing  controllers  for  efficiency  goals  has  been  addressed,  via  the  consideration  of  priced 
versions  of  timed  games  [BCFL04.ABM04].  In  priced  timed  games,  price  rates  (or, 
symmetrically,  reward  rates)  are  associated  with  the  states  of  the  game,  and  prices  (or 
rewards)  with  its  transitions.  The  problem  that  has  so  far  been  addressed  is  the  synthe¬ 
sis  of  minimum-cost  controllers  for  reachability  goals  [BCFL04,  ABM04] .  In  this  paper, 
we  focus  instead  on  the  problem  of  synthesizing  controllers  that  maximize  the  average 
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by  the  ONR  grant  N00014-02-1-0671,  and  by  the  ARP  award  TO.030.MM.D. 
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Fig.  1.  A  game  automaton  where  player  1  can  freeze  time  to  achieve  a  higher  average  reward. 

reward1  per  time  unit  accrued  along  an  infi  nite  play  of  the  game.  This  is  an  expressive 
and  widely  applicable  effi  ciency  goal,  since  many  real-time  systems  are  modeled  as 
non-terminating  systems  which  exhibit  infi  nite  behaviors. 

We  consider  timed  games  played  between  two  players  over  discrete-time  game 
structures  with  fi  nite  state  space.  At  each  round,  both  players  independently  choose 
a  move.  We  distinguish  between  immediate  moves ,  which  correspond  to  control  actions 
or  system  transitions  and  take  0  time,  and  timed  moves.  There  are  two  timed  moves: 
the  move  Ao,  which  signifi  es  the  intention  to  wait  for  0  time,  and  the  move  A\,  which 
signifi  es  the  intention  of  waiting  for  1  time  unit.  The  two  moves  chosen  by  the  players 
jointly  determine  the  successor  state:  roughly,  immediate  moves  take  the  precedence 
over  timed  ones,  and  unit-length  time  steps  occur  only  when  both  players  play  A  \ .  Each 
state  is  associated  with  a  reward  rate,  which  specifi  es  the  reward  obtained  when  staying 
at  the  state  for  one  time  unit.  We  consider  zero-sum  rewards,  so  that  a  reward  of  +r  to 
one  player  corresponds  to  a  reward  of  —  r  to  the  other  player.  These  game  structures  can 
be  specifi  ed  using  a  notation  similar  to  that  of  timed  automata.  Each  location  is  labeled 
by  a  reward  rate,  and  by  two  invariants  (rather  than  one),  which  specify  how  long  the 
two  players  can  stay  at  the  location;  the  actions  labeling  the  edges  correspond  to  the 
immediate  moves  of  the  players. 

The  goal  of  each  player  is  to  maximize  the  long-run  average  reward  it  receives  per 
time  unit;  however,  this  goal  is  subordinate  to  the  requirement  that  players  should  not 
block  the  progress  of  time  by  playing  forever  zero-delay  moves  (immediate  moves, 
or  Aq).  As  an  example,  consider  the  game  of  Figure  1.  The  strategy  that  maximizes  the 
reward  per  time  unit  calls  for  player  1  staying  forever  at  qo :  this  yields  an  average  reward 
per  time  unit  of  4.  However,  such  a  strategy  would  block  time,  since  the  clock  x  would 
not  be  able  to  increase  beyond  the  value  2,  due  to  the  player- 1  invariant  x  <  2  at  qo.  If 
player  1  plays  move  a1,  time  can  progress,  but  the  average  reward  per  time  unit  is  1.  To 
prevent  players  from  blocking  time  in  their  pursuit  of  higher  average  reward,  we  defi  ne 
the  value  of  a  play  of  the  game  in  a  way  that  enforces  time  progress.  If  time  diverges 
along  the  play,  the  value  of  the  play  is  the  average  reward  per  time  unit  obtained  along 
it.  If  time  does  not  diverge  along  the  play,  there  are  two  cases.  If  a  player  contributes 
to  blocking  the  progress  of  time,  then  the  value  of  the  play  to  the  player  is  —  if 
the  progress  of  time  is  blocked  entirely  by  the  other  player,  then  the  value  of  the  play 
to  the  player  is  +°°.  These  defi  nitions  are  based  on  the  treatment  of  time  divergence 
in  timed  games  of  [dAFH+03,dAHS02],  According  to  these  defi  nitions,  even  though 
the  reward  rate  is  zero-sum,  and  time-divergent  plays  have  zero-sum  values,  the  games 
are  not  zero-sum,  due  to  the  treatment  of  time  divergence.  Since  we  are  interested  in 
the  problem  of  controller  design,  we  defi  ne  the  value  of  a  game  to  a  player  to  be  the 
maximal  play  value  that  the  player  is  able  to  secure,  regardless  of  how  the  adversary 


1  With  a  sign  change,  this  is  obviously  equivalent  to  minimizing  the  average  cost. 
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plays.  The  resulting  games  are  not  determined:  that  the  values  that  the  two  players  can 
secure  do  not  sum  to  zero.  We  show  that  there  is  no  symmetrical  formulation  that  can 
at  the  same  time  enforce  time  progress,  and  lead  to  a  determined  setting. 

We  provide  algorithms  for  computing  the  value  of  the  game  for  either  player.  The 
algorithms  are  based  on  the  relationship  between  the  original,  infi  nite -round,  game,  and 
a  derived  game  that  is  played  on  the  same  discrete-time  game  structure,  but  for  only 
fi  nitely  many  rounds.  As  in  [EM79],  the  derived  game  terminates  whenever  one  of  the 
two  players  closes  a  loop;  our  construction,  however,  differs  from  [EM79]  in  how  it 
assigns  a  value  to  the  loops,  due  to  our  different  notion  of  value  of  a  play.  We  show  that 
a  player  can  achieve  the  same  value  in  the  fi  nite  game,  as  in  the  original  infi  nite-round 
game.  Our  proof  is  inspired  by  the  argument  in  [EM79],  and  it  closes  some  small  gaps 
in  the  proof  of  [EM79]. 

The  equivalence  between  fi  nite  and  infi  nite  games  provides  a  PSPACE  algorithm 
for  computing  the  value  of  average  reward  discrete-time  games.  We  improve  this  re¬ 
sult  by  showing  that  both  fi  nite  and  infi  nite  games  admit  memoryless  optimal  strategies 
for  each  player.  Once  we  fi  x  a  memoryless  strategy  for  a  player,  the  game  is  reduced 
to  a  graph.  We  provide  a  polynomial-time  algorithm  that  enables  the  computaton  of 
the  value  of  the  graph  for  the  other  player.  The  algorithm  is  based  on  polynomial-time 
graph  transformations,  followed  by  the  application  of  Karp’s  algorithm  for  computing 
the  minimum/maximal  average  cost  of  a  cycle  [Kar78].  The  existence  of  memoryless 
strategies,  together  with  this  algorithm,  provide  us  with  a  polynomial  witness  and  with 
a  polynomial-time  algorithm  for  checking  the  witness.  Since  this  analysis  can  be  done 
both  for  the  winning  strategies  of  a  player,  and  for  the  “spoiling”  strategies  of  the  oppo¬ 
nent,  we  conclude  that  the  problem  of  computing  the  value  of  an  average-reward  timed 
game,  for  either  player,  is  in  NP  (T  coNP.  This  matches  the  best  known  bounds  for  several 
other  classes  of  games,  among  which  are  turn-based  deterministic  parity  games  [EJ91] 
and  turn-based  stochastic  reachability  games  [Con92],  Since  the  maximum  average  re¬ 
ward  accumulated  in  the  fi  rst  n  time  units  cannot  be  computed  by  iterating  n  times  a 
dynamic-programming  operator,  the  weakly-polynomial  algorithm  of  [ZP96]  cannot  be 
adapted  to  our  games;  the  existence  of  polynomial  algorithms  is  an  open  problem. 

The  goal  of  minimizing  the  long-run  average  cost  incurred  during  the  life  of  a  real¬ 
time  system  has  been  considered  previously  in  [BBL04].  There,  the  underlying  model  is 
a  timed  automaton,  and  the  paper  solves  the  verifi  cation  problem  (“what  is  the  minimum 
long-run  average  cost  achievable?”),  or  equivalently,  the  control  problem  for  a  fully 
deterministic  system.  In  contrast,  the  underlying  computational  model  in  this  paper  is 
a  timed  game,  and  the  problem  solved  is  the  control  of  a  nondeterministic  real-time 
system. 

Compared  to  other  work  on  priced  timed  games  [BCFL04.ABM04],  our  models  for 
timed  games  are  simplifi  ed  in  two  ways.  First,  rewards  can  only  be  accrued  by  staying 
at  a  state,  and  not  by  taking  transitions.  Second,  we  study  the  problem  in  discrete  time. 
On  the  other  hand,  our  models  are  more  general  in  that,  unlike  [BCFL04.ABM04],  we 
do  not  impose  structural  constraints  on  the  game  structures  that  ensure  the  progress  of 
time.  There  is  a  tradeoff  between  imposing  structural  constraints  and  allowing  rewards 
for  transitions:  had  we  introduced  constraints  that  ensure  time  progress,  we  could  have 
easily  accommodated  for  rewards  on  the  transitions.  The  restriction  to  discrete-time  lim- 
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its  somewhat  the  expressiveness  of  the  models.  Nevertheless,  control  problems  where 
the  control  actions  can  be  issued  only  at  discrete  points  in  time  are  very  common:  most 
real  controllers  are  driven  by  a  periodic  clock;  hence,  the  discrete-time  restriction  is  not 
unduly  limiting  as  far  as  the  controller  actions  are  concerned.  We  note  that  there  are 
also  many  cases  where  the  system  actions  can  be  considered  to  occur  in  discrete-time: 
this  is  the  case,  for  instance,  whenever  the  state  of  the  system  is  sampled  regularly  in 
time. 

2  Discrete-Time  Game  Structures 

We  defi  ne  discrete-time  game  structures  as  a  discrete-time  version  of  the  timed 
game  structures  of  [dAFH+03].  A  discrete-time  game  structure  represents  a  game 
between  two  players,  which  we  denote  by  1,  2;  we  indicate  by  the  opponent 
of  i  £  {1,2}  (that  is,  player  3  —  i).  A  discrete-time  game  structure  is  a  tuple  ^  = 
(, S,Actsi,ActS2,r\,r2,8,r ),  where: 

-  S  is  a  fi  nite  set  of  states. 

-  Acts\  and  Acts 2  are  two  disjoint  sets  of  actions  for  player  1  and  player  2,  respec¬ 
tively.  We  assume  that  Ao,  A]  Actst  and  write  M\  =  Acts,  U  {Ag,  A\ }  for  the  sets 
of  moves  of  player  i  £  {1,2}. 

-  For  i  £  {1,2},  the  function  fj :  S  1— >  2Mi  \0  is  an  enabling  condition,  which  assigns 
to  each  state  s  a  set  17 (s)  of  moves  available  to  player  i  in  that  state. 

-  8  :  S  x  (Mi  UM2)  1— »  S  is  a  destination  function  that,  given  a  state  and  a  move  of 
either  player,  determines  the  next  state  in  the  game. 

-  r  :  S 1— >  Z  is  a  function  that  associates  with  each  state  s  £  S  the  reward  rate  of.?:  this 
is  the  reward  that  player  1  earns  for  staying  for  one  time  unit  at  s. 

The  move  Aq  represents  an  always-enabled  stuttering  move  that  takes  0  time:  we  require 
that  for  s  £  S  and  i  £  {1,2},  we  have  Aq  £  17(s)  and  8(s,Ao)  =  s.  The  moves  in  {Ao}  U 
Acts  1  U A rf,s'2  are  known  as  the  zero-time  moves.  The  move  A\  represents  the  decision 
of  waiting  for  1  time  unit.  We  do  not  require  that  Ai  be  always  enabled:  if  we  have 
A\  T](s)  for  player  i  £  {1,2}  at  a  state  s  £  S,  then  player  i  cannot  wait,  but  must 
immediately  play  a  zero-time  move.  We  defi  ne  the  size  of  a  discrete-time  game  structure 

by|sf|=E,6s(|JiWI  +  IWI). 

2.1  Move  Outcomes,  Runs,  and  Strategies 

A  timed  game  proceeds  as  follows.  At  each  state  s  £  S,  player  1  chooses  a  move  a 1  £ 
Tj  (,v ) ,  and  simultaneously  and  independently,  player  2  chooses  a  move  a2  £  IT}?).  The 
set  of  successor  states  (j}.?,^1,^2)  C  S  is  then  determined  according  to  the  following 
rules. 

-  Actions  take  precedence  over  stutter  steps  and  time  steps.  If  a 1  G  Acts  1  or  a2  £ 
ActS2 ,  then  the  game  takes  an  action  a  selected  nondeterministically  from  A  = 
{fl1,^2}  H  (Acts  1  UACA2),  an£l  8(s,al  ,a2)  =  {8(s,a)  \  a  £  A}. 

-  Stutter  steps  take  precedence  over  time  steps.  If  a1, a2  £  {Ao,A]},  there  are  two 
cases. 

•  If  a  —  Aq  or  a2  =  Aq,  the  game  performs  a  stutter  step,  and  8(s,al  ,a2)  =  {.?}. 
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•  If  a1  =  a2  =  Ai,  then  the  game  performs  a  time  step  of  duration  1,  and  the  game 
proceeds  to  S(s,al,a2)  =  {5(s,Ai)}. 

An  infinite  run  (or  simply  run)  of  the  discrete-time  game  structure  kS  is  a  sequence 
so,  (a},a2),si,(a2,a2),S2,...  such  that  sk  £  S,  a^+1  £  Fl(sA.),  aj+l  &r2(sk),  and  s*,+1  £ 

8(sk,al+l,al+l)  for  all  k  >  0.  A  finite  run  a  is  a  fi  nite  prefi  x  of  a  run  that  terminates 
at  a  state  s,  we  then  set  last) a)  =  s.  We  denote  by  FRuns  the  set  of  all  fi  nite  runs  of 
the  game  structure,  and  by  Runs  the  set  of  its  infi  nite  runs.  For  a  fi  nite  or  infi  nite  run  a , 

and  a  number  k  <  |oj,  we  denote  by  <J . k  the  prefi  x  of  a  up  to  and  including  state  q .  A 

state  s'  is  reachable  from  another  state  s  if  there  exists  a  fi  nite  run  $,  (fl[,u2),si , . . .  ,sn 
such  that  so  =  s  and  s„  =  s'. 

A  strategy  7T,  for  player  i  £  {1,2}  is  a  mapping  nt  :  FRuns  >  Mj  that  asso¬ 
ciates  with  each  finite  run  $,  (fl[,a2),si , . . .  ,s„  the  move  7T,(so, (a},a2),si , . . .  ,s„)  to 
be  played  at  s„.  We  require  that  the  strategy  only  selects  enabled  moves,  that  is, 
7tj( a)  £  rfilastio))  for  all  cr  £  FRuns.  For  i  £  {1,2},  let  17,-  denote  the  set  of  all  player 
i  strategies.  A  strategy  : r,-  for  player  /  £  {1,2}  is  memoryless  if  for  all  a,  a'  £  FRuns  we 
have  that  last{o)  —  last{o')  implies  7T,(cj)  =  Tlj(o').  For  strategies  Tt\  £  IJ\  and  n2  £  TI2 , 
we  say  that  a  run  sq,  (aj,a2),si, . . .  is  consistent  with  tt\  and  n2  if,  for  all  n  >  0  and 
i  =  1,2,  we  have  7r,(so, (u},a2),si , . . .  ,s„)  =  aln+l.  We  denote  by  Outcomes{s,Tt\,n2 ) 
the  set  of  all  runs  that  start  in  s  and  are  consistent  with  tt\,n2.  Note  that  in  our  timed 
games,  two  strategies  and  a  start  state  yield  a  set  of  outcomes,  because  if  the  players 
both  propose  actions,  a  nondeterministic  choice  between  the  two  moves  is  made.  Ac¬ 
cording  to  this  defi  nition,  strategies  can  base  their  choices  on  the  entire  history  of  the 
game,  consisting  of  both  past  states  and  moves. 

2.2  Discrete-Time  Game  Automata 

We  specify  discrete-time  game  structures  via  discrete-time  game  automata,  which  are 
a  discrete-time  version  of  the  timed  automaton  games  of  [dAFH+03];  both  models  are 
two-player  versions  of  timed  automata  [AD94].  A  clock  condition  over  a  set  C  of  clocks 
is  a  boolean  combination  of  formulas  of  the  form  x  A  c  or  x — y  A  c,  where  c  is  an  integer, 
x,y  £  C,  and  A  is  either  <  or  <.  We  denote  the  set  of  all  clock  conditions  over  C  by 
ClkConds(C).  A  clock  valuation  is  a  function  K  :  C  >  ®>o,  and  we  denote  by  K(C)  the 
set  of  all  clock  valuations  for  C. 

A  discrete-time  game  automaton  is  a  tuple  srf  = 

( Q,C,Actsi,Acts2,E,0,p,Invi,Inv2,Rew ),  where: 

-  Q  is  a  fi  nite  set  of  locations. 

-  C  is  a  fi  nite  set  of  clocks. 

-  Acts i  and  Acts2  are  two  disjoint,  finite  sets  of  actions  for  player  1  and  player  2, 
respectively. 

-  E  C  Q  x  (Acfsi  UAcfS2)  x  Q  is  an  edge  relation. 

-  0  :  E  i— >  ClkConds(C)  is  a  mapping  that  associates  with  each  edge  a  clock  con¬ 
dition  that  specifi  es  when  the  edge  can  be  traversed.  We  require  that  for  all 
(q,a,q i),  ( q,a,q2 )  £  E  with  q\  ^  q2 ,  the  conjunction  Q{q,a,q\)  A0{q,a,q2)  is  un- 
satisfi  able.  In  other  words,  the  game  move  and  clock  values  determine  uniquely  the 
successor  location. 
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-  p  :  £  i— >  2C  is  a  mapping  that  associates  with  each  edge  the  set  of  clocks  to  be  reset 
when  the  edge  is  traversed. 

-  Inv  i .  InV2  :  Q  i— >  ClkConds(C)  are  two  functions  that  associate  with  each  location 
an  invariant  for  player  1  and  2,  respectively. 

-  Rew  :Qv-+rL  is  a  function  that  assignes  a  reward  Rew(q)  £  Z  with  each  q  £  Q. 

Given  a  clock  valuation  K  :  C  i— >  !R>o,  we  denote  by  K  +  1  the  valuation  defi  ned  by 
(tc  +  l)(x)  =  k(x)  +  1  for  all  clocks  x  £  C.  The  clock  valuation  K  :  C  i— >  IR>o  satisfies 
the  clock  constraint  a  £  ClkConds(C),  written  K  \=  a,  if  a  holds  when  the  clocks  have 
the  values  specifi  ed  by  K.  For  a  subset  C  C  C  of  clocks,  k[C'  :=  0]  denotes  the  valuation 
defi  ned  by  k[C  :=  0](x)  =  0  if  x  £  C' ,  and  by  k[C'  :=  0](x)  =  k(x)  otherwise. 

The  discrete-time  game  automaton  sf  induces  a  discrete-time  game  structure  [[.e/]] , 
whose  states  consist  of  a  location  of  si  and  a  clock  valuation  over  C.  The  idea  is  the 
following.  The  move  Aq  is  always  enabled  at  all  states  (q,  k),  and  leads  again  to  (q,  k). 
The  move  A\  is  enabled  forplayer  i  £  {1,2}  at  state  (, q ,  k)  if  K+  1  |=  Invfiq):  the  move 
leads  to  state  (q,  K+  1).  For  player  i  £  {1,2}  and  a  £  Actst ,  the  move  a  is  enabled  at 
a  state  (q,  k)  if  there  is  a  transition  {q,a,q')  in  E  which  is  enabled  at  (q,  k),  and  if  the 
invariant  Im’j(q')  holds  for  the  destination  state  (q1,  K[p{q1a1q’)  :=  0]).  If  the  values  of 
the  clocks  can  grow  unboundedly,  this  translation  would  yield  an  infi  nite-state  discrete¬ 
time  game  structure.  However,  we  can  defi  ne  clock  regions  similarly  to  timed  automata 
[  AD94],  and  we  can  include  in  the  discrete-time  game  structure  only  one  state  per  clock 
region;  as  usual,  this  leads  to  a  fi  nite  state  space. 


3  The  Average  Reward  Condition 

In  this  section,  we  consider  a  discrete-time  game  structure  = 
{, S,Actsi,ActS2,r\,r2,S,r ),  unless  otherwise  noted. 


3.1  The  Value  of  a  Game 


We  consider  games  where  the  goal  for  player  1  consists  in  maximizing  the  aver¬ 
age  reward  per  time  unit  obtained  along  a  game  outcome.  The  goal  for  player  2 
is  symmetrical,  and  it  consists  in  minimizing  the  average  reward  per  time  unit  ob¬ 
tained  along  a  game  outcome.  To  make  these  goals  precise,  consider  a  fi  nite  run 
a  =  Co,  {(T|  •  <T| , . . . ,  (7„.  For  k  >  1,  the  time  Z>t  elapsed  at  step  k  of  the  run  is  de¬ 

fi  ned  by  I\(c>  )  =  1  if  =  a}  =  A\,  and  Z>a  (ct)  =0  otherwise;  the  reward  Ifi  accrued 
at  step  k  of  the  run  is  given  by  Rfia)  =  r(o^_i)  ■  Dfid).  The  time  elapsed  during  cr  and 
the  reward  achieved  during  cr  are  defi  ned  in  the  obvious  way,  by  D(a)  =YH=i  At(°') 
and  Ria)  =  Y!k=  \  (£ ) •  Finally,  we  defi  ne  the  long-run  average  reward  of  an  infi  nite 

run  o’  by: 


r(cr/)  =  liminf 


*(<*<„) 

D(°'<ny 


A  fi  rst  attempt  to  defi  ne  the  goal  of  the  game  consists  in  asking  for  the  maximum 
value  of  this  long-run  average  reward  that  player  1  can  secure.  According  to  this  ap¬ 
proach,  the  value  for  player  1  of  the  game  at  a  state  s  would  be  defi  ned  by 


v(Sf,s)  =  sup  inf  inf{r(cr)  |  a  £  Outcomes(s,  7:1,712)} ■ 
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However,  this  approach  fails  to  take  into  account  the  fact  that,  in  timed  games,  players 
must  not  only  play  in  order  to  achieve  the  goal,  but  must  also  play  realistic  strategies 
that  guarantee  the  advancement  of  time.  As  an  example,  consider  the  game  of  Figure  1. 
We  have  v((qo,  [x  :  =  0]))  =  4,  and  the  optimal  strategy  of  player  1  consists  in  staying 
at  qo  forever,  never  playing  the  move  a1.  Due  to  the  invariant  x  <  2,  such  a  strategy 
blocks  the  progress  of  time:  once  x  =  2,  the  only  move  player  1  can  play  is  A{).  It  is  easy 
to  see  that  the  only  strategies  of  player  1  that  do  not  block  time  eventually  play  move 
a1,  and  have  value  1.  Note  that  the  game  does  not  contain  any  blocked  states,  i.e.,  from 
every  reachable  state  there  is  a  run  that  is  time-divergent:  the  lack  of  time  progress  of 
the  above-mentioned  strategy  is  due  to  the  fact  that  player  1  values  more  obtaining  high 
average  reward,  than  letting  time  progress. 

To  ensure  that  winning  strategies  do  not  block  the  progress  of  time,  we  modify  the 
deli  nition  of  value  of  a  run,  so  that  ensuring  time  divergence  has  higher  priority  than 
maximizing  the  average  reward.  Following  [dAFH+03],  we  introduce  the  following 
predicates: 

-  For  ;  £  {1,2},  we  denote  by  blameless'  {o)  (“ blameless  i”)  the  predicate  defined 
by  3n  >  O.Vk  >  n.a’k  =  A\.  Intuitively,  blameless' [a)  holds  if,  along  a ,  player  i 
beyond  a  certain  point  cannot  be  blamed  for  blocking  time. 

-  We  denote  by  td(<j)  (“time-divergence”)  the  predicate  deli  ned  by  V;;  >  0 . 3k  > 
"•[(°*  =4i)A(o£  =  4t)]. 

We  deli  ne  the  value  of  a  run  a  £  Runs  for  player  i  £  { 1 , 2}  by: 

!+°°  if  blameless' (a)  A~<td(a)\ 

(— 1)(/+1)  r(cr)  if  td(c)\  (1) 

— °°  if  -^blameless' (a)  A~>td(a). 

It  is  easy  to  check  that,  for  each  run,  exactly  one  of  the  three  cases  of  the  above  deli  nition 
applies.  Notice  that  if  td(cr)  holds,  then  vi’i(cr)  =  —  W2(d),  so  that  the  value  of  time- 
divergent  runs  is  deli  ned  in  a  zero-sum  fashion.  We  deli  ne  the  value  of  the  game  for 
player  i  at  ,v  £  S  as  follows: 

v,-(Sf,s)  =  sup  inf  inf{w,-((T)  f  <7  £  Outcomes(s,n  1,7*2)}-  (2) 

jn-err,- 

We  omit  the  argument  Sf  from  v,-(Sf  ,s)  when  clear  from  the  context. 

We  say  that  a  state  s  £  S  is  well-formed  if,  for  all  i  £  {1,2},  we  have  v,-(s)  >  — 
From  (1)  and  (2),  a  state  is  well-formed  if  both  players  can  ensure  that  time  progresses 
from  that  state,  unless  blocked  by  the  other  player:  this  is  the  same  notion  of  well- 
formedness  introduced  in  [dAHS02,dAFH+03],  Since  we  desire  games  where  time 
progresses,  we  consider  only  games  consisting  of  well-formed  states. 

3.2  Determinacy 

A  game  is  determined  if,  for  all .?  £  S,  we  have  vi  (s)  +  V2(s)  =  0:  this  means  that  if  player 
i  £  {1,2}  cannot  enforce  a  reward  c  1 1R,  then  player  can  enforce  at  least  reward  —c. 
The  following  theorem  provides  a  strong  non-determinacy  result  for  average-reward 
discrete-time  games. 
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r  —  —c 


,l 


,2 


r  —  +c 


Fig.  2.  A  game  automaton.  Unspecified  guards  and  invariants  are  “true”. 

Theorem  1.  (non-determinacy)  For  all  c  >  0,  there  exists  a  game  structure  Sf  = 
(5,Ac/si,Ac/S2i^liJ2i5,r)  with  a  state  s  £  S,  and  two  “spoiling”  strategies  £  Up 
K*  £  TI2 ,  such  that  the  following  holds: 

sup  sup{wi(cr)  |  a  £  Outcomes{s,ni,i$)}  <  —c 

nielli 

sup  sup{w2  (c7 )  |  a  £  Outcomes{s,  n^nf)}  <  —  c. 

7T2GI72 


As  a  consequence,  v\  (s)  <  —  c  and  V2 (s)  <  —c. 

Note  that  in  the  theorem  we  take  sup,  rather  than  inf  as  in  (2),  over  the  set  of  outcomes 
arising  from  the  strategies.  Hence,  the  theorem  states  that  even  if  the  choice  among 
actions  is  resolved  in  favor  of  the  player  trying  to  achieve  the  value,  there  is  a  game 
with  a  state  s  where  vi(s)  +  t;2(s)  <  —2c  <  0.  Moreover,  in  the  theorem,  the  adversary 
strategies  are  fi  xed,  again  providing  an  advantage  to  the  player  trying  to  achieve  the 
value. 

Proof.  Consider  the  game  of  Figure  2.  We  take  for  K'l  £  TI\  and  7r|  £  Ui  the  strategies 
that  play  always  Ao  in  qo,  and  A\  elsewhere.  Let  so  =  (qo,  [x  :=  0]},  and  consider  the 
value 

VI (so)  =  sup  sup{wi (ct )  |  o  £  Outcomes{so,ni,if[)} . 

Tt\  elli 

There  are  two  cases.  If  eventually  player  1  plays  forever  Ao  in  so,  player  1  obtains  the 
value  — °°,  as  time  does  not  progress,  and  player  1  is  not  blameless.  If  player  1 ,  whenever 
at  sq,  eventually  plays  a1,  then  the  value  of  the  game  to  player  1  is  — c.  Hence,  we  have 
v\(so)  =  —  c.  The  analysis  for  player  2  is  symmetrical.  I 

The  example  of  Figure  2,  together  with  the  above  analysis,  indicates  that  we  cannot 
deli  ne  the  value  of  an  average  reward  discrete-time  game  in  a  way  that  is  symmetrical, 
leads  to  determinacy,  and  enforces  time  progress.  In  fact,  consider  again  the  case  in 
which  player  2  plays  always  Ao  at  so-  If,  beyond  some  point,  player  1  plays  forever  Ao 
in  so,  time  does  not  progress,  and  the  situation  is  symmetrical  wrt.  players  1  and  2:  they 
both  play  forever  Ao-  Hence,  we  must  rule  out  this  combination  of  strategies  (either 
by  assigning  value  —  °°  to  the  outcome,  as  we  do,  or  by  some  other  device).  Once  this 
is  ruled  out,  the  other  possibility  is  that  player  1,  whenever  in  so,  eventually  plays  a1. 
In  this  case,  time  diverges,  and  the  average  value  to  player  1  is  — c.  As  the  analysis  is 
symmetrical,  the  value  to  both  players  is  — c,  contradicting  determinacy. 
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4  Solution  of  Average  Reward  Timed  Games 

In  this  section,  we  solve  the  problem  of  computing  the  value  of  an  average  reward 
timed  game  with  respect  to  both  players.  First,  we  defi  ne  a  turn-based  version  of  the 
timed  game.  Such  version  is  equivalent  to  the  fi  rst  game  when  one  is  concerned  with 
the  value  achieved  by  a  specific  player.  Then,  following  [EM79],  we  define  a  finite 
game  and  we  prove  that  it  has  the  same  value  as  the  turn-based  inti  nite  game.  This  will 
lead  to  a  PSPACE  algorithm  for  computing  the  value  of  the  game.  We  then  show  that  the 
fi  nite  and,  consequently,  the  infi  nite  game  admit  memoryless  optimal  strategies  for  both 
players;  as  mentioned  in  the  introduction,  this  will  enable  us  to  show  that  the  problem 
of  computing  the  value  of  the  game  is  in  NP  n  coNP. 

In  the  remainder  of  this  section,  we  consider  a  fi  xed  discrete-time  game  structure 
Sf  =  (i S,ActSi,ActS2,ri,T2,S,r ),  and  we  assume  that  all  states  are  well-formed.  We  fo¬ 
cus  on  the  problem  of  computing  iq  (s),  as  the  problem  of  computing  v2(s)  is  symmet¬ 
rical.  For  a  fi  nite  run  cr  and  a  fi  nite  or  infi  nite  run  frsuch  that  last{o)  = first  (o'),  we 
denote  by  a  ■  o'  their  concatenation,  where  the  common  state  is  included  only  once. 


4.1  Turn-based  Timed  Game 


We  describe  a  turn-based  version  of  the  timed  game,  where  at  each  round  player  1 
chooses  his  move  before  player  2.  Player  2  can  thus  use  her  knowledge  of  player  l’s 
move  to  choose  her  own.  Moreover,  when  both  players  choose  an  action,  the  action 
chosen  by  player  2  is  carried  out.  This  accounts  for  the  fact  that  in  the  defi  nition  of 
i'i(s),  nondeterminism  is  resolved  in  favor  of  player  2  (see  (2)).  Notice  that  if  player  2 
prefers  to  carry  out  the  action  chosen  by  player  1,  she  can  reply  with  the  stuttering  move 
A{).  Defi  nitions  pertaining  this  game  have  a  “t°°”  superscript  that  stands  for  “turn-based 
infi  nite”.  We  defi  ne  the  turn-based  joint  destination  functions1 :  S  x  M\  x  M2  1— >  S  by 


S\s 


S(s,Ai)  ifal=a2=A\ 

S(s,Aq)  if  {a1, a2}  C  {Ao,Ai}  and  a1  =  Aq  or  a2  =  Aq 
if  a1  £Acts  1  and  a2  £  {Ao,A]} 

S(s,a2)  if  a2  £Acts2 


As  before,  a  run  is  an  infinite  sequence  (a\,a\),si,  (a\, 02)^2,  ■  ■  ■  such  that  sk  £  S, 

ak+i  e  G  (■**)>  al+\  e  r2(sk),  and  .v*+i  £  Sl(sk,alk+l,a2+1)  for  all  k  >  0.  A  1  -run 
is  a  fi  nite  prefi  x  of  a  run  ending  in  a  state  £,  while  a  2 -run  is  a  fi  nite  prefi  x  of 
run  ending  in  a  move  a  £  M\.  For  a  2-run  cr  =  sq,  {a\,a\),s\, . . .  ,sn,  (fiL-t),  we  set 
last  (so ,  (a\ ,  a2 ) ,  s  1 , . . . ,  sn,  (a \+ 1 ) )  =  s„  and  lasta  (so,  («! ,  a\ ) ,  it , . . . ,  sn ,  (a\+l ) )  =  a\+ 1 . 
For  i  £  {1,2},  we  denote  by  FRunsj  the  set  of  all  /-runs.  Intuitively,  /-runs  are  runs  where 
it  is  player  /’ s  turn  to  move.  In  the  turn-based  game,  a  strategy  K,  for  player  /  £  {1,2} 
is  a  mapping  TZj  :  FRunsi  1 — >M;  such  that  7T,  ( cr )  £  rf  lastl,  cr  )  )  for  all  cr  £  FRunsi.  For 
i  £  {1,2},  let  nj  denote  the  set  of  all  player  /  strategies;  notice  that  17 f  =  JTi .  Player- 
1  memoryless  strategies  are  defi  ned  as  usual.  We  say  that  a  player-2  strategy  n  £  T? 
is  memoryless  iff,  for  all  cr.cr'  £  FRunsi,  last(<j)  =  last(o')  and  lasta(<j)  =  lasta(G') 
imply  7r(cr)  =  n{o'). 

For  strategies  7Z\  £  JT{  and  tr2  £  IJf  we  say  that  a  run  sq,  {a\-,a\ ), .  is  consistent 
with  TZ\  and  n2  if,  for  all  n  >  0  and  /  =  1,2,  we  have  TZ\  (sq,  {a\,a\),s\ , . . .  ,sn)  =  ajl+ 1 
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and  ^2('SO)(apfli)isi5---is«)(a,1,+i))  =  al+i-  Since  is  deterministic,  for  all  s  £  S, 
there  is  a  unique  run  that  starts  in  .v  and  is  consistent  with  Ji\  and  712  ■  We  denote  this 
run  by  outcomes'™ (s,  7t\ ,  TTi)-  The  value  assigned  to  a  run,  to  a  strategy  and  to  the  whole 
game  are  defi  ned  as  follows.  We  set  M5j°°(cr)  =  vtq(cr),  and 

vt1“(i,7Ti)=  inf  w'{°  (outcomes'™  (s,7Z\,  712))',  VT(S)=  SUP  Vi“(s,^i). 
n2em  ttj  en' 

The  following  theorem  follows  from  the  defi  nition  of  turn-based  game  and  from  (2). 

Theorem  2.  For  all  s  £  S,  it  holds  iq  (s)  =  vt1°°(s). 


4.2  Turn-based  Finite  Game 

We  now  defi  ne  a  fi  nite  turn-based  game  that  can  be  played  on  a  discrete-time  game 
structure.  Defi  nitions  pertaining  this  game  have  a  “tf  ’  superscript  that  stands  for  “turn- 
based  fi  nite”.  The  fi  nite  game  ends  as  soon  as  a  loop  is  closed.  A  maximal  run  in  the 
finite  game  is  a  1-run  cr  =  (a\,a2),s\, . . .  ,sn  such  that  sn  is  the  first  state  that  is 

repeated  in  cr.  Formally,  n  is  the  least  number  such  that  sn  =  sj,  for  some  j  <  n.  We 
set  loop(o)  to  be  the  suffix  of  cr:  Sj,  (aj+1,a^+1) , . . .  ,sn.  For  7l\  £  JTf,  %2  £  IJj,  and 
s  £  S,  we  denote  by  outcomes'1  (s ,  7l\ ,  JI2)  the  unique  maximal  run  that  starts  in  s  and  is 
consistent  with  71  \  and  %2- 

In  the  fi  nite  game,  a  maximal  run  cr  ending  with  the  loop  A  is  assigned  the  value  of 
the  infi  nite  run  obtained  by  repeating  A  forever.  Formally,  Wf(cr)  =  vtq  (cr  -  A®),  where 
A®  denotes  the  concatenation  of  numerably  many  copies  of  A.  The  value  assigned  to  a 
strategy  7t\  £  /Tf  and  the  value  assigned  to  the  whole  game  are  defi  ned  as  follows. 

v'i(s, 7ti)  =  inf  w'l(outcomestS(s,Ki,K2 ));  vf(s)  =  sup  Vi(s,7iq). 

n2en\ 


Notice  that  since  this  game  is  fi  nite  and  turn-based,  for  all  s  £  S,  it  holds: 

sup  inf  w i(outcomes'f (s, 7ti, 712))  =  inf  sup  w') (outcomes'' (s.  71  \ ,  712))-  (3) 

Kien^2en2  ^2  en2^l6rri 


4.3  Mapping  Strategies 

We  introduce  defi  nitions  that  allow  us  to  relate  the  fi  nite  game  to  the  infi  nite  one.  For  a 
1-run  cr  =  sq,  (a\,a\),s\, . . .  ,s„,  \etfirstloop(ct)  be  the  operator  that  returns  the  fi  rst  sim¬ 
ple  loop  fif  any)  occurring  in  cr.  Similarly,  let  loopcut(a)  be  the  operator  that  removes 
the  fi  rst  simple  loop  fif  any)  from  a.  Formally,  if  cr  is  a  simple  run  (i.e.  it  contains  no 
loops)  we  set  firstloop(o)  =  e  (the  empty  sequence),  and  loopcut(o)  =  cr.  Otherwise, 
let  k  >  0  be  the  smallest  number  such  that  cr;  =  cr/;,  for  some  j  <  k\  we  set 

firstloop(o)  =  Oj,  (alj+l,a2j+l),.. . ,  (a\,a2k),ck\ 
loopcut(o)  =0Q,(a\,a2),...,Oj,  {, alk+ua2+l ),. . .  ,C7„. 


We  now  defi  ne  the  quasi-segmentation  QSeg(a)  to  be  the  sequence  of  simple  loops 
obtained  by  applying  firstloop  repeatedly  to  cr. 


QSeg(a) 


(  £  if firstloop(a)  =  e 

\  firstloop(o)  ■  QSeg(loopcut(a))  otherwise 
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Fig.  3.  Nodes  linked  by  dashed  lines  represent  the  same  state  of  the  game. 


For  an  infinite  run  <7,  we  set  QSeg(o)  =  linv,°° QSeg(o „).  Given  a  finite  run  cr, 
loopcut  can  only  be  applied  a  fi  nite  number  of  times  before  it  converges  to  a  fi  xpoint. 
We  call  this  fi  xpoint  resid{o).  Notice  that  for  all  runs  a ,  resid(o)  is  a  simple  path  and 
therefore  its  length  is  bounded  by  \S\, 

For  simplicity,  we  developed  the  above  definitions  for  1-runs.  The  corresponding 
deli  nitions  of  resid(o)  and  QSeg(o)  for  2-runs  cr  are  similar. 

For  all  i  £  {1,2}  and  all  strategies  n  £  17/,  we  define  the  strategy  ft  as  ft  {a)  = 
n(resid{o))  for  all  cr  £  FRunsi.  Intuitively,  ft  behaves  like  K  until  a  loop  is  formed.  At 
that  point,  fc  forgets  the  loop,  behaving  as  if  the  whole  loop  had  not  occurred.  We  now 
give  some  technical  lemmas. 

Lemma  1.  Let  7ti  £  TI\,  TZi  £  FI\,  and  o  =  outcomes'* (s,ft\,  712).  For  all  k  >  0, 
resid(c<k)  A  a  prefix  of  a  finite  run  consistent  with  K\.  Formally,  there  is  7r{  £  Fl\ 
and  o'  =  outcomes'* (s,  TZi^n'f)  such  that  o'  =  resid(o<fi)  ■  p. 

Similarly,  let  O  =  outcomes'*  (s,7li,  fa).  For  all  k  >  0,  there  is  Tl[  £  TTj  and  o'  = 
outcomes'* (s,  n[ ,  fa)  such  that  o'  =  resid{o<k)  ■  p. 

Proof.  We  prove  the  fi  rst  statement,  as  the  second  one  is  analogous.  We  proceed  by 

induction  on  the  length  of  QSeg(o<k)-  If  QSeg(o< k)  is  the  empty  sequence  (i.e.  o . /; 

contains  no  loops),  the  result  is  easily  obtained,  as  7ti  coincides  with  K\  until  a  loop  is 
formed.  So,  we  can  take  n'~,  =  Tti  and  obtain  the  conclusion. 

On  the  other  hand,  suppose  QSeg(o</,)  =k\,..  ■  ,A„.  For  simplicity,  suppose  Ai  / 
X.2-  As  illustrated  in  Figure  3,  let  Oj  be  the  fi  rst  state  after  A  that  does  not  belong  to  A| . 
Then,  rr,  |  belongs  to  A]  and  there  is  another  index  i  <  j  —  1  such  that  c,  =  cr,  | .  So, 
the  game  went  twice  through  Oj_\  and  two  different  successors  were  taken.  However, 
player  1  must  have  chosen  the  same  move  in  cr,  and  Oj-  \ ,  as  by  construction  7t\  (cr<,)  = 
ft\  (o<j- 1 ).  Therefore,  the  change  must  be  due  to  a  different  choice  of  Ki.  It  is  easy  to 
devise  n'~,  that  coincides  with  Ji2,  except  that  Ai  may  be  skipped,  and  at  cr,,  the  successor 
Oj  is  chosen.  We  can  then  obtain  a  run  p  =  outcomes"* (s,  ft] .  nf)  and  an  integer  k'  >  0 
such  that  QSeg(p<ic/)  =  At,  . . .  ,A„  and  resid(p<p)  =  resid(p).  The  thesis  is  obtained  by 
applying  the  inductive  hypothesis  to  p  and  k! .  I 

Using  this  lemma,  we  can  show  that  for  all  K\  £  77| ,  each  loop  occurring  in  the 
infi  nite  game  uuderAi  can  also  occur  in  the  fi  nite  game  under  ,Tj . 

Lemma  2.  Let  ti\  £  LI[,  7t2  £  FT/,  and  o  =  outcomes"* (s ,  7fi ,  ^t)-  For  all  A  £  QSeg(o), 
A  can  occur  as  the  final  loop  in  a  maximal  run  of  the  finite  game.  Formally,  there  is 
n'-,  £  TI\  and  o'  =  outcomes' f(s,  Ki,n'2)  such  that  A  =  loop(o'). 
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Similarly,  let  o  =  outcomes'™  (s  ,7l\,  ftf).  For  all  A  €  QSeg(cr),  there  is  n[  £  JTj  and 
o'  =  outcomes'* (s ,  7t[,7t 2)  such  that  A  =  loop(o'). 

The  next  lemma  states  that  if  the  strategy  7zy  of  player  1  achieves  value  V  in  the 
fi  nite  turn-based  game,  the  strategy  ft\  achieves  at  least  as  much  in  the  infi  nite  turn- 
based  game. 

Lemma  3.  For  all  s  £  S  and  Tt\  £  Tl\,  it  holds  v'™(s,  ft\  )  >  vf(s,^i). 

Proof.  Let  v  =  v‘[(s,  K\ ).  We  show  that  ft\  can  ensure  reward  v  in  the  infi  nite  game.  The 
result  is  trivially  true  if  v  =  —  °°.  So,  in  the  following  we  assume  that  v  >  —  °°. 

Fix  a  player  2  strategy  712  G  IJf  and  let  o  =  outcomes'™ (s,  Let  QSeg(o)  = 

Ai ,  A2 _ We  distinguish  two  cases,  according  to  whether  time  diverges  or  not  in  cr.  If 

time  diverges,  all  loops  Ay  that  contain  no  tick  give  no  contribution  to  the  value  of  o 
and  can  therefore  be  ignored. 

For  all  Ay  containing  (at  least)  a  time  step,  by  Lemma  2,  Ay  is  a  possible  terminating 
loop  for  the  fi  nite  game  under  7 5 .  Thus,  R(X y)  >  V  •  D(Ay).  Now,  the  value  of  0  can  be 
split  as  the  value  due  to  loops  containing  time  steps,  plus  the  value  due  to  the  residual. 
For  all  n  >  0,  let  mn  be  the  number  of  loops  in  QSeg(o<n).  We  obtain: 


w'f(o)  = 


liminf 


R(a<») 

F>(o<n ) 


R(resid(o<n))  +  Yl'JlxR(^j) 
],!^^  D(resid(o<n))  +  £'"2,  D(Ay) 


=  liminf 


Ly=l  R(^j) 

It  D{Xj) 


T-in 


>  V. 


Consider  now  the  case  when  cr  contains  only  fi  nitely  many  time  steps.  Let  k  >  0  be 
such  that  no  time  steps  occur  in  cr  after  cry.  Consider  a  loop  Ay  entirely  occurring  after 
Ok-  Obviously  Ay  contains  no  time  steps.  Moreover,  by  Lemma  2,  Ay  is  a  terminating 
loop  for  a  maximal  run  p  in  the  fi  nite  game  under  /Zj .  Since  v'j(s.  7l\ )  >  —00,  it  must 
be  w'l(p)  =  +00.  Consequently,  it  holds  blameless1  (p)  and  in  particular  player  1  is 
blameless  in  all  edges  in  Ay. 

Now,  let  k'  >  0  be  such  that  each  state  (and  edge)  after  cry./  will  eventually  be  part 
of  a  loop  of  QSeg(o).  Let  k"  =  ma x{k,k'}.  Then,  all  edges  that  occur  after  k"  will 
eventually  be  part  of  a  loop  where  player  1  is  blameless.  Consequently,  k"  is  a  witness 
to  the  fact  that  blameless1  (cr),  and  therefore  w'f(o)  =  +°°  >  V.  I 

Lemma  4.  For  all  s  £  S  and  712  €  Tl\,  it  holds  v't1“0(s,  nf)  <  Vjf(s,  7th). 

Proof  Let  v  =  Similarly  to  Lemma  3,  we  can  rule  out  the  case  V  =  +°° 

as  trivial.  Fix  a  player  1  strategy  ny,  and  let  cr  =  outcomes'™ (s, We  show 
that  vv'|“(cr)  <  v.  If  time  diverges  on  cr,  the  proof  is  similar  to  the  analogous  case  in 
Lemma  3.  Otherwise,  let  k  >  0  be  such  that  no  time  steps  occur  in  cr  after  cry.  Con¬ 
sider  a  loop  A  £  QSeg(o),  entirely  occurring  after  cry.  Obviously  A  contains  no  time 
steps.  Moreover,  by  Lemma  2,  A  is  a  terminating  loop  for  a  maximal  run  p  in  the  fi  nite 
game  under  7zy .  Since  vf(s,7ti)  <  +°°,  it  must  be  wf(p)  =  Consequently,  it  holds 
-1 blameless 1  (p )  and  in  particular  player  1  is  blamed  in  some  edge  of  A .  This  shows  that 
-^blameless1  (o),  and  consequently  ^“(cr)  =  <  V.  I 


Lemmas  3  and  4  show  that  the  infi  nite  game  is  no  harder  than  the  fi  nite  one,  for  both 
players.  Considering  also  (3),  we  obtain  the  following  result. 
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Theorem  3.  For  all  s  £S,  vt“(s)  =  v'j(s). 

Theorems  2  and  3  allow  us  to  use  the  fi  nite  game  to  compute  the  value  of  the  original 
timed  game.  The  length  of  the  fi  nite  game  is  bounded  by  |S|.  It  is  well-known  that  a 
recursive,  backtracking  algorithm  can  compute  the  value  of  such  game  in  PSPACE. 

Theorem  4.  For  all  s  £  S,  v  \  (s)  can  be  computed  in  PSPACE. 

4.4  Memory 

By  following  the  “forgetful  game”  construction  and  proofs  used  by  [EM79],  we  can 
derive  a  similar  result  on  the  existence  of  memoryless  strategies  for  both  players.  The 
proof  depends  on  the  fact  that  the  value  of  forgetful  game  is  the  same  as  the  turn-based 
fi  nite  game  (and  hence,  the  same  as  the  infi  nite  game,  from  Theorem  3),  and  follows 
the  same  inductive  steps  as  provided  in  [EM79]. 

Theorem  5.  For  all  i  £  {1,2},  and  t  £  S,  there  exists  a  memoryless  optimal  strategy  for 
player  i.  Formally,  there  exists  Ttj  £  Tli  such  that  v;i  (t,  71,)  =  Vi  (t). 

4.5  Improved  Algorithms 

We  show  that,  given  s  £  S,  V  £  Q  and  i  £  {1,2},  the  problem  of  checking  whether 
vf(s)  >  v  is  in  NPHcoNP.  The  decision  problem  vf(s)  >  V  is  in  NP  because  a  memo¬ 
ryless  strategy  for  player  1  acts  as  a  polynomial-time  witness:  once  such  a  strategy  7l\ 
is  fi  xed,  we  can  compute  in  polynomial  time  the  value  vf  (jf,  7Ti ).  The  problem  is  also 
in  coNP  because,  once  a  memoryless  strategy  of  player  2  is  fi  xed,  we  can  compute  in 
polynomial  time  the  value  vf  (s,  fo). 

Once  we  fi  x  a  memoryless  strategy  for  player  i  £  { 1, 2},  the  fi  nite  game  is  reduced 
to  a  multigraph  where  all  the  choices  belong  to  player  It  is  convenient  to  defi  ne 
the  set  of  vertices  of  the  multigraph  as  U  =  {{s}  |  s  £  S},  rather  than  simply  as  S.  Let 
E  be  the  set  of  edges  of  the  multigraph.  Each  edge  e  £  E  is  labeled  with  the  pair  of 
moves  (a1 ,  a2)  £M\X  M2  played  by  the  players  along  e.  We  label  e  with  tick  whenever 
a1  =  a2  =  A\,  and  with  /?/,■  whenever  a1  £  ActSj  U  {Ao};  every  edge  e  from  {s}  to  {t} 
is  also  associated  with  reward  r(s)  if  it  has  label  tick,  and  reward  0  otherwise.  We 
indicate  paths  in  this  graph  by  iiQ,e\,u\,e2,  ■  ■  ■  ,un,  where  e,-  is  an  edge  from  m,_i  to  n,-, 
for  1  <  i  <  n.  Given  a  strongly  connected  component  (SCC)  ( V.  F ) ,  where  V  C  (/  and 
F  CE,  we  collapse  ( V,F )  as  follows:  (i)  we  replace  in  U  the  vertices  in  V  by  the  single 
vertex  {JV;  (ii)  we  remove  all  edges  in  F;  (iii)  we  replace  every  edge  from  v  £  V  to 
u£U\V  (resp.  from  u  £U\V  to  v;  £  V)  with  an  edge  of  the  same  label  from  |J  V  to 
u  (resp.  from  u  to  |J  V);  (iv)  we  replace  every  edge  e  0  F  from  v  £  V  to  v '  £  V  with  a 
self-loop  of  the  same  label  from  {JE  to  U  V. 

To  determine  the  value  of  this  multigraph  to  player  1,  we  fi  rst  transform  the  multi¬ 
graph  so  that  all  edges  are  labeled  with  tick,  and  we  then  apply  Karp’s  algorithm  for 
computing  the  loop  with  minimum  or  maximum  average  reward  [Kar78],  We  proceed 
depending  on  whether  player  1 ,  or  player  2,  fi  xes  a  memory  less  strategy.  When  player  1 
fi  xes  a  memoryless  strategy: 

1.  Find  a  maximal  SCC  (V,  F),  where  V  C  U  and  F  C  E,  such  that  all  edges  in  F 
are  labeled  with  -1 tick  and  ->bl\.  Player  2  will  want  to  avoid  following  this  SCC 
forever;  thus,  we  collapse  it.  Repeat  until  no  more  SCCs  can  be  collapsed. 
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2.  If  a  vertex  u  G  U  has  no  outgoing  edges,  it  means  that  player  2  could  not  avoid 
entering  and  following  one  of  the  SCCs  collapsed  above.  Hence,  for  each  u  G  U 
without  outgoing  edges,  remove  u  from  the  graph  along  with  all  incoming  edges, 
and  assign  value  +°°  to  all  .v  G  u.  Repeat  until  no  more  vertices  can  be  removed. 

3.  Find  all  the  loops  whose  edges  are  all  labeled  with  -i tick .  Due  to  the  collapsing  in 
the  above  steps,  each  of  these  loops  contains  at  least  one  edge  labeled  bl\,  so  its 
value  when  followed  forever  is  — °°.  Remove  all  such  vertices  from  the  graph,  and 
assign  value  —  °°  to  the  corresponding  states. 

4.  From  the  resulting  multigraph  G,  construct  a  multigraph  G'  with  the  same  vertices 
as  G.  For  each  simple  path  in  G  of  the  form  uo,e\,ui , . . .  ,un,en+\,un+\  where  the 
edges  ei,...,en  are  labeled  by  -i  tick,  and  the  edge  e„+\  is  labeled  by  tick,  we  insert 
in  G'  an  edge  from  uq  to  an+\  labeled  by  the  same  reward  as  en+\ . 

5.  Use  the  algorithm  of  [Kar78]  to  fi  nd  the  loop  with  minimal  average  reward  in  G 
(the  algorithm  of  [Kar78]  is  phrased  for  graphs,  but  it  can  be  trivially  adapted  to 
multigraphs).  If  r  is  the  average  reward  of  the  loop  thus  found,  all  the  vertices  of 
the  loop,  and  all  the  vertices  that  can  reach  the  loop,  have  value  r.  Remove  them 
from  G' ,  and  assign  value  r  to  the  corresponding  states.  Repeat  this  step  until  all 
vertices  have  been  removed. 

Similarly  (but  not  symmetrically),  if  player  2  fi  xes  a  memoryless  strategy,  we  can  com¬ 
pute  the  value  for  player  1  as  follows: 

1 .  Find  all  the  loops  where  all  the  edges  are  labeled  with  -i tick  and  —1/7/1  -  These  loops, 
and  all  the  vertices  that  can  reach  them,  have  value  +°°.  Remove  them  from  the 
graph,  and  assign  value  +°°  to  the  corresponding  states. 

2.  Find  a  maximal  SCC  (V,F),  where  VC  U  and  F  C  E,  such  that  all  edges  in  F  are 
labeled  with  -> tick .  Due  to  the  previous  step,  every  loop  in  (V,  F)  contains  at  least 
one  edge  labeled  bl\,  and  player  1  will  want  to  avoid  following  forever  such  an 
SCC:  thus,  we  collapse  (' V,F ). 

3.  For  each  u  C  U  without  outgoing  edges,  remove  u  from  the  graph  along  with  all 
incoming  edges,  and  assign  value  —  °°  to  all  s  G  u.  Repeat  until  no  more  vertices 
can  be  removed. 

4.  From  the  resulting  multigraph  G,  construct  a  multigraph  G'  as  in  step  4  of  the 
previous  case. 

5.  This  step  is  the  same  as  step  5  of  the  previous  case,  except  that  in  each  iteration  we 
fi  nd  the  loop  with  maximal  average  reward. 

Since  the  algorithm  of  [Kar78],  as  well  as  the  above  graph  manipulations,  can  all  be 
done  in  polynomial  time,  we  have  the  following  result. 

Theorem  6.  The  problem  of  computing  the  value  to  player  i  C  {1,2}  of  a  discrete-time 
average  reward  game  is  in  NPHcoNP. 

We  note  that  the  maximal  reward  that  a  player  can  accrue  in  the  fi  rst  n  time  units  cannot 
be  computed  by  iterating  n  times  a  dynamic -programming  operator,  as  is  the  case  for 
untimed  games.  In  fact,  each  player  can  play  an  unbounded  number  of  zero-time  moves 
in  the  fi  rst  n  time  units,  so  that  even  the  fi  nite  time-horizon  version  of  our  games  requires 
the  consideration  of  time  divergence.  Hence,  it  does  not  seem  possible  to  adapt  the 
approach  of  [ZP96]  to  obtain  a  weakly-polynomial  algorithm.  Whether  polynomial¬ 
time  algorithms  can  be  achieved  by  other  means  is  an  open  problem. 
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